% Chapter Template

\chapter{Introduction} % Main chapter title

% \label{Intro} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}

% \lhead{Chapter 1. \emph{Introduction}} % Change X to a consecutive number; this is for the header on each page - perhaps a shortened title

%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------
\section{Problem Statement}

A protein can be viewed as a sequence of amino acid residues. In Bioinformatics, the purpose of protein sequence search against databases is to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the protein sequences.
Such a similarity search produces an alignment, i.e. seeks to align similar substrings of the sequences being compared.

Classical sequence alignment algorithms such as Needleman-Wunsch \citep{Needleman}, Smith-Waterman \citep{SW}, and the BLAST family of programs \citep{Altschul} have long been used for searching protein by performing pairwise alignment of each query against every sequence in the database, thus identifying those sequences in the database that are most closely related to various regions of the query.

Besides the above pairwise comparison algorithms, another paradigm compares a sequence to a probabilistic representation of several
proteins of the same family. Since all the sequences in a family are mostly similar to each other, it is possible to construct a common profile representing the \emph{consensus sequence}, which simply reflects the most commonly occurring residue at each position. One such probabilistic representation is called the \emph{profile HMM} (Hidden Markov Model) introduced by Anders Krogh and David Haussler \citep{Krogh}, which is a promising approach to improve the sensitivity of database-searching.

The profile HMM is a statistical model of a multiple sequence alignment, or even of a single sequence. The main strength of a profile HMM is that it is probabilistic finite state machine. This means that it assesses the probability of match, insert and delete at a given position of an alignment.

Unlike conventional pairwise comparisons, a consensus profile HMM model can in principle utilize additional statistical information, such as the position and identity of residues that are more or less conserved throughout the family, as well as variable insertion and deletion probabilities. By developing a statistical model that is based on known sequences in a protein family, a profile HMM can be used to model the protein sequence family.

In Bioinformatics, HMMER \citep{HMMER} is a free and commonly used software package for sequence analysis based on the profile HMM. 
Based on the strength of its underlying profile HMMs, HMMER aims to be significantly more accurate and more able to detect remote homologs, compared to BLAST, FASTA, and other sequence alignment tools and database search tools \citep{HMMER3}.

From 1992 to 1998, the HMMER1 series was developed by Sean Eddy. It includes a feature that is missing in HMMER2 and HMMER3: the \emph{hmmt} program for training HMMs from initially unaligned sequences and hence creating multiple alignments. The final stable version of HMMER1 was released as 1.8.5 in 2006. 

From 1998 to 2003, the HMMER2 series introduced the ``Plan 7" profile HMM architecture, which is still shared with HMMER3, and was the basic foundation for Pfam and other protein domain databases. It includes local and global alignment modes that HMMER3 lacks, because HMMER3 currently implements only fully local alignment. HMMER2 lacks DNA comparison that was present in HMMER1. The final stable version of HMMER2 was released as 2.3.2 in 2003.

In HMMER, the application \emph{hmmbuild} is used to build a profile HMM using a multiple sequence alignment, or single sequence as input. The application \emph{hmmsearch} is used to search a profile HMM against a sequence database, finding whether a sequence is member of the family described by the profile HMM. The \emph{hmmsearch} application outputs a ranked list of the sequences with the most significant matches to the profile. Another similar application in HMMER, \emph{hmmscan}, is the query of a single protein sequence of interest against a database of profile HMMs.

To compare a profile HMM with a protein sequence, HMMER uses the Viterbi algorithm detailed in subsection \ref{ViterbiSub}, which evaluates the path that has the maximum probability of the profile HMM generating the sequence. The Viterbi algorithm is a dynamic programming algorithm.
The fundamental task of the Viterbi algorithm for biological sequence alignment is to calculate three DP (Dynamic Programming) matrices: $M[~]$ for Match state, $I[~]$ for Insert state and $D[~]$ for Delete state. Each element value in the DP matrix is dependent on the value of previous element.

However, the widely used implementation of the Viterbi algorithm in HMMER2, has been slow and compute-intensive, on the order of more than 100 times slower than BLAST for a comparable search. In an era of enormous sequence databases, this speed disadvantage outweighs any advantage of the profile HMM method.

With the exponential growth of protein databases, there is an increasing demand for acceleration of such techniques. HMMER has been a target of many acceleration and optimization efforts. 

Specialized hardware architectures have been used to exploit coarse-grained parallelism in accelerating HMMER2. JackHMMer \citep{Wun} uses the Intel IXP 2850 network processor. \citep{Maddimsetty}, \citep{Derrien} and \citep{Oliver} use FPGAs (Field-Programmable Gate Arrays). Sachdeva et al. use the Cell Broadband Engine developed by IBM \citep{Sachdeva}.

On traditional CPU architecture, MPI-HMMER \citep{Walters2006} is a well-known and commonly used MPI implementation. In their studies, a single master node is used to assign multiple database blocks to worker nodes for computing in parallel and is responsible for collecting the results. Landman et al. exploit MPI and Intel SSE2 intrinsics to accelerate HMMER2 \citep{Landman}.

Graphics Processing Unit (GPU), as a specialized processor originally intended for manipulating computer graphics, has been shown to provide very attractive compute resources in addition to CPU and the above specialized hardware architectures, because of particular many-core parallel computation in a modern GPU.

Based on the GPUs produced by NVIDIA, NVIDIA introduced CUDA (Computer Unified Device Architecture) to facilitate general-purpose computing on GPUs. NVIDIA also developed the CUDA C/C++ compiler, libraries, and runtime software to enable programmers to access the GPU parallel computing capabilities.

ClawHMMer \citep{ClawHMMER} is the first GPU-enabled \emph{hmmsearch} implementation. Their implementation is based on the BrookGPU stream programming language, not the CUDA programming model. Since ClawHMMer, there has been several researches on accelerating HMMER for CUDA-enabled GPU. 
\citep{GPUHMM}, \citep{Ganesan}, \citep{Du} and \citep{Quirem} parallelized the Viterbi algorithm on CUDA-enabled GPUs.
However, these efforts have had limited impact on accelerating HMMER2 with speedups of only an order of magnitude.

In 2010, HMMER3.0 was released. It is the most significant acceleration of hmmsearch. The most significant difference between HMMER3 and HMMER2 is that HMMER3 uses a heuristic algorithm called the MSV filter, for Multiple (local, ungapped) Segment Viterbi, to accelerate profile HMM searches. By using the Intel SSE2 intrinsics to implement programs, HMMER3 is substantially more sensitive, and 100 to 1000 times faster than HMMER2 \citep{HMMER3}.

HMMER3.1 beta was released in 2013. It has several new features that did not make them into 3.0, including \emph{nhmmer} program for DNA homology searches with profile HMMs, the parallel search daemon \emph{hmmpgmd} program underlying HMMER Web Services, and a new HMM file format called 3/f format.

Although HMMER3 is much faster than HMMER2 and about as fast as BLAST for protein searches, it is still time-consuming. According to the CPU Benchmarks \citep{cpus} updated on 17th of June 2014, the Intel Core i7-3930K @ 3.20GHz ranks as the 2nd among `Common CPUs' and as the 23rd among `High End CPUs' in terms of performance. Our benchmark result in Section \ref{Pbench} shows that even run by this high power CPU, HMMER3 needs about 5 minutes to search a profile HMM with length 255 against the NCBI NR database.

\citep{Ahmed} uses Intel VTune Analyzer \citep{Intel} to investigate performance hotspot functions in HMMER3. Based on hotspot analysis, they study CUDA acceleration for three individual algorithms: Forward, Backward and Viterbi.

According to the implementation of HMMER3 detailed in Figure \ref{fig:hmmsearch}, the MSV, Viterbi, Forward and Backward algorithms are implemented in the so-called ``acceleration pipeline" at the core of the HMMER3 software package \citep{HMMER3}. The MSV algorithm is the first filter of the ``acceleration pipeline" and is the key hotspot of the whole process. Therefore, this thesis concentrates on porting the MSV filter onto a CUDA-enabled GPU to accelerate the \emph{hmmsearch} application.

\section{Research Objectives}
The objective of this research is to \emph{implement and accelerate hmmsearch of HMMER3 on a CUDA-enabled GPU}. Our implementation will be based on the HMMER3 MSV algorithm so that the result will be same as the \emph{hmmsearch} of HMMER3. Central to this goal is understanding the MSV algorithm and CUDA multi-thread parallel programming. 

In Bioinformatics, Smith-Waterman algorithm is also famous for sequence alignment using dynamic programming. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. Similar to the MSV algorithm, the Smith-Waterman algorithm is also fairly demanding of time: to align two sequences of lengths m and n, \emph{O}(mn) time is required. There has been a great deal of research on accelerating the Smith-Waterman algorithm on a CUDA-enabled GPU. From studying these research, we learn many optimization approaches to apply into our research.

\section{Research Contributions}
The contribution of this thesis can be classified as follows:
\begin{itemize}
 \item Analyze the core application \emph{hmmsearch} in HMMER3 and find the key hotspot, the MSV filter, for accelerating hmmsearch.
 \item Implement the protein sequence search tool \emph{cudaHmmsearch} on a CUDA-enabled GPU. Demonstrate many optimization approaches to accelerate cudaHmmsearch.
 \item Discuss and analyze the advantages and limitations of GPU hardware for CUDA parallel programming.
 \item Summarize the six steps for optimizing performance using CUDA programming.
\end{itemize}

\section{Organization of thesis}
The rest of this thesis is organized as follows:

Chapter \ref{Background} introduces the background necessary for understanding the work in this thesis.

Chapter \ref{CUDAHMMER3} presents the details of our \emph{cudaHmmsearch} implementation and optimization approaches. The six steps are summarized for better performance of CUDA programming at the end of this Chapter.

We performed comprehensive benchmarks which are presented and analyzed in Chapter \ref{Results}. 

The conclusion of Chapter \ref{Conclusions} summarizes our contributions, points out its limitations, and makes suggestions for future work.

\section{Typographical Conventions}
The following font conventions are used in this thesis:
\begin{itemize}
 \item {\fontfamily{phv}\fontseries{m}\selectfont Adobe Helvetica font}\\
 Used for code examples.
 \item {\fontfamily{phv}\fontseries{m}\selectfont \textsl{Adobe Helvetica slanted font}}\\
 Used for comments of code.
 \item {\fontfamily{pag}\selectfont Adobe AvantGarde font}\\
 Used for captions of table, figure and listing.
\end{itemize}
